Maximising the Size of Non-Redundant Protein Datasets Using Graph Theory
نویسندگان
چکیده
Analysis of protein data sets often requires prior removal of redundancy, so that data is not biased by containing similar proteins. This is usually achieved by pairwise comparison of sequences, followed by purging so that no two pairs have similarities above a chosen threshold. From a starting set, such as the PDB or a genome, one should remove as few sequences as possible, to give the largest possible non-redundant set for subsequent analysis. Protein redundancy can be represented as a graph, with proteins as nodes connected by undirected edges, if they have a pairwise similarity above the chosen threshold. The problem is then equivalent to finding the maximum independent set (MIS), where as few nodes are removed as possible to remove all edges. We tested seven MIS algorithms, three of which are new. We applied the methods to the PDB, subsets of the PDB, various genomes and the BHOLSIB benchmark datasets. For PDB subsets of up to 1000 proteins, we could compare to the exact MIS, found by the Cliquer algorithm. The best algorithm was the new method, Leaf. This works by adding clique members that have no edges to nodes outside the clique to the MIS, starting with the smallest cliques. For PDB subsets of up to 1000 members, it usually finds the MIS and is fast enough to apply to data sets of tens of thousands of proteins. Leaf gives sets that are around 10% larger than the commonly used PISCES algorithm, that are of identical quality. We therefore suggest that Leaf should be the method of choice for generating non-redundant protein data sets, though it is ineffective on dense graphs, such as the BHOLSIB benchmarks. The Leaf algorithm is available at: https://github.com/SimonCB765/Leaf, and sets from genomes and the PDB are available at: http://www.bioinf.manchester.ac.uk/leaf/.
منابع مشابه
Analysis of the enzyme network involved in cattle milk production using graph theory
Understanding cattle metabolism and its relationship with milk products is important in bovine breeding. A systemic view could lead to consequences that will result in a better understanding of existing concepts. Topological indices and quantitative characterizations mostly result from the application of graph theory on biological data. In the present work, the enzyme network involved in cattle...
متن کاملتخمین مکان نواحی کدکننده پروتئین در توالی عددی DNA با استفاده پنجره با طول متغیر بر مبنای منحنی سه بعدی Z
In recent years, estimation of protein-coding regions in numerical deoxyribonucleic acid (DNA) sequences using signal processing tools has been a challenging issue in bioinformatics, owing to their 3-base periodicity. Several digital signal processing (DSP) tools have been applied in order to Identify the task and concentrated on assigning numerical values to the symbolic DNA sequence, then app...
متن کاملA New Hybrid Framework for Filter based Feature Selection using Information Gain and Symmetric Uncertainty (TECHNICAL NOTE)
Feature selection is a pre-processing technique used for eliminating the irrelevant and redundant features which results in enhancing the performance of the classifiers. When a dataset contains more irrelevant and redundant features, it fails to increase the accuracy and also reduces the performance of the classifiers. To avoid them, this paper presents a new hybrid feature selection method usi...
متن کاملApplication of Graph Theory: Investigation of Relationship Between Boiling Temperatures of Olefins and Topological Indices
Abstract: In this study an appropriate computational approach was presented for estimating the boiling temperatures of 41 different types of olefins and their derivatives. Based on the guidelines of this approach, several structural indices related to the organic components were applied using graph theory. Meanwhile, in addition to evaluating the relation between the boiling temperatures of ole...
متن کاملInverse Kinematics Resolution of Redundant Cooperative Manipulators Using Optimal Control Theory
The optimal path planning of cooperative manipulators is studied in the present research. Optimal Control Theory is employed to calculate the optimal path of each joint choosing an appropriate index of the system to be minimized and taking the kinematics equations as the constraints. The formulation has been derived using Pontryagin Minimum Principle and results in a Two Point Boundary Value Pr...
متن کامل